t-SNE overview

Basic information

  • t-SNE: t-distributed stochastic neighbor embedding

  • Method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map

  • Computational complexity of direct implementation: O(N2)

  • Original paper - Laurens van der Maaten, Geoffrey Hinton, 2008

T-sne is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and Geoffrey Hinton, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

Advantages Disadvantages
Handles non-linear data efficiently Computationally Complex
Preserves local and global structure Non-deterministic
Requires Hyperparameter Tuning

Algorithm

Example dataset

Let’s assume that we have 5 datapoints (“A”, “B”, “C”, “D”, “E”) embedded in 3-dimensional space. Our objective is to embed these points in the 2-dimensional space so that the local structure is preserved.

x y z
A 1.00 0.90 0.10
B 1.10 1.00 0.20
C 1.05 1.05 -0.10
D 1.50 1.02 0.10
E 1.40 1.01 0.15

Step 1 - calculate the distances between points

Firstly we calculate the euclidean distances between the points in the high-dimensional space, i.e.:

d(x,y)=i=1n(xiyi)2

A B C D E
A 0.000 0.173 0.255 0.514 0.418
B 0.173 0.000 0.308 0.413 0.304
C 0.255 0.308 0.000 0.493 0.432
D 0.514 0.413 0.493 0.000 0.112
E 0.418 0.304 0.432 0.112 0.000

Step 2 - calculate the probability distribution in high-dim space

For each point in our dataset we calculate the similarity to other points, which is described in the original paper as follows: similarity of datapoint xj to datapoint xi is the conditional probability pj|i that xi would pick xj as its neighbor. Note that definition is not symmetric, i.e. pj|ipi|j.

Let’s pick one point, e.g. “A” and calculate the aforementioned conditional probabilities with respect to point “A”:

x
Dist A-A 0.000
Dist A-B 0.173
Dist A-C 0.255
Dist A-D 0.514
Dist A-E 0.418

We map the distances to the Gaussian distribution with mean = 0. The probabilities that point “A” will pick the point “B” as its neighbor is equal to the value of density of normal distribution (0,σA) in point “Dist A-B”:

Additionally, we assume that pi|i=0 by definition. After we calculate the densities for all points, we normalize the values in order to have the probability distribution. Overall, the probability that point i will pick point j is given by the formula:

pj|i:=exp(||xixj||2/2σi2)kiexp(||xixk||2/2σi2)

The remaining issue is the algorithm of determining the sigma values in the normal distribution (c.f. Appendix 1).

Original dataset (high-dim)
Point A Point B Point C Point D Point E
distance (euclidian) 0 0.173 0.255 0.514 0.418
probability 0 0.752 0.701 0.470 0.563
norm. probability 0 0.302 0.282 0.189 0.226

Now, we have the similarity of datapoint “A” to other datapoints defined as the conditional probability pA|X (A would pick X as it neighbor):

PA:
pB|A=0.302
pC|A=0.282
pD|A=0.189
pE|A=0.226

Let’s calculate all 5 distributions (one distribition per one datapoint):

Transformed dataset (low-dim)
Point A Point B Point C Point D Point E
P_A 0.000 0.302 0.282 0.189 0.226
P_B 0.285 0.000 0.250 0.215 0.251
P_C 0.292 0.275 0.000 0.204 0.229
P_D 0.204 0.246 0.213 0.000 0.337
P_E 0.221 0.260 0.215 0.305 0.000

Step 3 - calculate probability distribution in low-dim space

Let’s do exactly the same thing in low-dimensional space by performing the following steps:

  • we randomly select the coordinates of all datapoints in low-dimensional space
  • instead of normal distribution we apply t-student distribution

Point A Point B Point C Point D Point E
distance (euclidian) 0 0.130 0.366 0.248 0.227
probability 0 0.771 0.610 0.706 0.720
norm. probability 0 0.275 0.217 0.252 0.257

Overall, the probability, that point i will pick point j is given by the formula:

qj|i:=exp(||xixj||2)kiexp(||xixk||2)

Step 4 - train like a supervised model

Now, for each point we have two distributions: in low-dim space (Q) and high-dim space (P).

A final table
Point A Point B Point C Point D Point E
P_A 0.000 0.302 0.282 0.189 0.226
Q_A 0.000 0.275 0.217 0.252 0.257
P_B 0.285 0.000 0.250 0.215 0.251
Q_B 0.254 0.000 0.232 0.256 0.258
P_C 0.292 0.275 0.000 0.204 0.229
Q_C 0.216 0.249 0.000 0.268 0.267
P_D 0.204 0.246 0.213 0.000 0.337
Q_D 0.232 0.256 0.250 0.000 0.262
P_E 0.221 0.260 0.215 0.305 0.000
Q_E 0.236 0.256 0.247 0.261 0.000

Now, the idea is to change to coordinates in low-dim space so that the Qi distributions are similar to the Pi ones so that we can reflect the local structures of high-dim space in low-dim space. Effectively, we tackle the problem of finding optimizing the parameters (just like in supervised model):

  • Parameters to be trained: coordinates of points in low-dimensional space
  • Actual values: P-distributions
  • Optimizer: stochastic gradient descent
  • Loss function: Kullback–Leibler divergence:

DKL(P||Q)=xXP(x)log2P(x)Q(x)

or, using the previous notations (for point A):

DKL(P||Q)=j{B,C,D,E}pj|Alog2pj|Aqj|A

For all points in the dataset:

DKL(P||Q)=i{A,B,C,D,E}j{A,B,C,D,E}pj|ilog2pj|iqj|i

In practise, we can also define pij:=pj|i+pi|j2n and change the loss function to the folowing form:

DKL(P||Q)=i{A,B,C,D,E}j{A,B,C,D,E}pjilog2pjiqji,

where pij=pji and qij=qji.

Final output

From technical reasons the original dataset was extended by adding some noise.

The results in 2-dim space are the following:

Examples

Iris

Iris dataset
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

MNIST

Appendix